Sparse long blocks and the variance of the LCS 
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Abstract 

Consider two random strings having the same length and generated by two 
mutually independent iid sequences taking values uniformly in a common finite 
alphabet. We study the order of the variance of the longest common subsequence 
(LCS) of these strings when long blocks, or other types of atypical substrings, are 
sparsely added into one of them. We show that the existence of the derivative of 
the mean LCS-curve at its maximum implies that this order is linear in the length 
of the strings. We also argue that our proofs carry over to many models used by 
computational biologists to simulate DNA-sequences. 
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1 Introduction 

Let x and y be two finite strings. A common subsequence of x and y is a subsequence 
which is a subsequence of both x and y, while a longest common subsequence (LCS) 
is a common subsequence of maximal length. For example, let x = heinrich and let 
y = enerico. Then z = ni is a common subsequence of x and y, indicating that the string 
ni can be obtained from both x and y by just deleting letters. Common subsequences can 
be represented via alignments, and for this the letters which are part of the subsequence 
get aligned while the other letters get aligned with gaps. The subsequence ni, corresponds 
to the alignment with gaps: 
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The common subsequence ni is not of maximal length, the LCS is enric, and the corre- 
sponding alignment is given by: 
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Often, a long LCS indicates that the strings are related. In this article, we only con- 
sider alignments which align same letter pairs, every such alignment defines a common 
subsequence, and the length of the subsequence corresponding to an alignment is called 
the score of the alignment. The alignment representing a LCS is also called an optimal 
alignment (OA). In the above example, the length of the LCS is five which is denoted by: 

\LCS(heinrich,enerico)\ = \LCS(x,y)\ = 5. 

Longest Common Subsequences (LCS) and Optimal Alignments (OA) are important 
tools used for string matching in Computational Biology and Computational Linguistics 
[H [2U [25] . A main application is to the automatic recognition of related DNA pieces. 
In that context, it is anticipated that if two DNA-strings have a common ancestor, then 
they will have a long LCS. Could it be that by chance (bad luck), unrelated (independent) 
strings have nonetheless a long LCS? How likely is such an event? This, of course, depends 
on the probabilistic model generating the strings. To answer the previous questions, the 
behavior, for n large, of both the expectation KLC n and the variance Var LC n need 
to be understood. (Throughout LC n is the length of the LCS of the random strings 
X = X x ■ ■ ■ X n and Y = Y 1 ■ ■ ■ Y n .) 

The asymptotic behavior of the expectation and the variance of the length of the LCS 
of two independent random strings has been studied by probabilists, physicists, computer 
scientists and computational biologists. It can also be formulated as a last passage per- 
colation problem with dependent weights. The problem of finding the fluctuation order 
for first and last passage percolation has been open for a while. There has been, how- 
ever, a well-known breakthrough for a related problem, that is for the Longest Increasing 
Subsequence (LIS) of a random permutation [5] or of a random word [HI [T71 [23]. For 
the LIS of a random permutation, the order of the fluctuation is the cubic root of the 
expectation and not its square root. For the LCS case, the expectation is of order n, and 
so if the fluctuations were also a cubic root of the expectation, then Var LC n should be of 
order n 2//3 . This is the order of magnitude conjectured in [9] for which several heuristic 
proofs have been claimed. This conjectured order might even seem more plausible in view 
of the recent solution [191 120] to the Bernoulli matching problem where the n 2 ^ 3 order is 
shown to be correct. We believe this order to be incorrect for the LCS. However, for short 
sequences this order might be what one approximately observes in simulations. For the 
LCS-problem, in case of independent iid strings the order of magnitude of the variance 
is, in general, not known. (Except for various cases of binary sequences, see [H], [T5] , 
[18], in which case the variance is asymptotically linear in the length of the strings con- 
sidered.) In the present article, we determine the correct asymptotic order of the variance 
for the LCS of uniform iid sequences "with artificially added impurities" provided the 
mean LCS-curve (see (jl.5p ) is differential at its maximum. 

The cubic root claims might happen because the LCS length can be viewed as a last 
passage percolation problem with dependent weights, and so, for short sequences the 
dependence in the weights does not have a strong influence, the LCS then behaves as if 
the weights were independent. Let us now explain how the LCS-problem can be viewed 
as a last passage percolation problem: 
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Let the set of vertices be 



V := {0,1,2,...,^} x {0,l,2,...,n}, 

and let the set of oriented edges S C V x V contain horizontal, vertical and diagonal 
edges. The horizontal edges are oriented to the right, while the vertical edges are oriented 
upwards, both having unit length. The diagonal edges point up-right at a 45-degree angle 
and have length y/2. Hence, 

£ := {(«, v + ei), (v + e 2 , v), (v, v + e 3 ) : v <E V} , 

where e\ := (1,0), e 2 := (0,1) and e 3 := (1,1). With the horizontal and vertical edges, 
we associate a weight of 0. With the diagonal edge from to (i + l,j + 1) we associate 
the weight 1 if Xi+\ = Yj+i and — oo otherwise. In this manner, we obtain that LC n := 
\LCS(XiX 2 . . . X n ; Y{Y 2 . . . Y n )\, is equal to the total weight of the heaviest path going 
from (0,0) to (n,n). Note that the weights in our 2-dimensional graph are not "truly 
2-dimensional" and they depend only on the one dimensional sequences X = X\ . . . X n 
and Y — Y\ . . . Y n . In our opinion, this is the reason for the order of magnitude of the 
variance of the LCS to be different from other first/last passage- related models. 
A subadditivity argument pioneered in [9] shows that the existence of 

7 * := nm » i.i 

n— >oo n 

where X and Y are two stationary ergodic strings independent of each other and where 
the constant 7^ > depends on the distribution of X and Y and on the size k of the 
alphabet. Even for the simplest distributions, such as iid strings with binary equiprobable 
letters, the exact value of 7^ is unknown, and extensive simulations have been performed 
to obtain approximate values [H 13 HOI El EE2l US] . 

The speed of convergence to the expected length in (11. ip was further determined in 
[U [2] , showing that for iid sequences, 



- CVnlogn < ELC n < 7>, (1.2) 



where C > is a constant depending neither on n nor on the distributon of X%. 

As previously mentioned, there exist contradicting conjectures for the order of the 
variance of the LCS. Our present result (Theorem 12. ip establishes the order conjectured 
in [26] for an iid distribution. We prove it, however, for iid sequences with added impurities 
(sparse long blocks or atypical substrings), assuming also that the mean LCS-curve has 
a well-defined derivative at its maximum, a condition which has not been proved to hold 
in full generality so far. The mean LCS-curve is the rescaled expectation of the LCS 
when the two sequences are taken to be of different length but in a fixed proportion. (See 
( II. 5p .) The impurities or "long blocks" as we call them, are substrings consisting only of 
one symbol which can be different from block to block. For that model, the variance is 
shown to be of order 6(n), i.e., there exist two constants C 2 > C\ > independent of 
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n, such that C\n < Var LC n < C 2 n, for all natural number n. (Here LC n is the length 
of the LCS of the two independent sequences X and Y of length n. One of the two 
sequences is to have sparsely added long blocks.) It is rather interesting that the mere 
differentiability of the mean curve at its maximum, implies a certain order of magnitude 
for the variance. Note that [211 122] proved that Var LC n < n, and so only good lower 
bounds for the variance of LC n are needed. (Simulation studies are not that numerous in 
case of the variance and at times contradict each other.) 

Still for iid sequences with k equiprobable letters, the order of VarLC n remains un- 
known. We hope nonetheless, that similar ideas could be helpful in fully tackling this 
problem. 

Overview of the main result of this paper 

We first need a few definitions: Let V\, V 2 , . . . and Wi, W 2 , ... be two independent iid 
sequences with k equiprobable letters, and let 

, .. E\LCS(V 1 V 2 ...V n ,W 1 W 2 ...W n )\ 
Ik ■= lim 



n 



As already mentioned, the exact value of is unknown, but lower and upper bounds are 
available, e.g., 

7„ O O A 

1.3) 



k 


2 


3 


4 






0.812 


0.717 


0.654 





where the precision in the above table is about ±0.01. The expected length of the LCS 
of two independent iid sequences both of length n is thus about 7j*n, up to an error term 
of order not more than a constant times a/ n log n (see ( 11.21) ). 

We can also consider two sequences of different lengths, but in such a way that the two 
lengths are in a fixed proportion of each other. To do so, let 

. nwsiv^Vz . . . V n . nq , W X W 2 . . . W n+nq )\ 

~/k{n,q):= , (1.4) 

n 

where q G [—1,1], and let 

7 A (g) := lim j k (n,q), (1.5) 

which again exists by subadditivity arguments. The function q i— > 7fc((?) is called the mean 
LCS-function, it is symmetric around q = and concave and it thus has a maximum at 
q = which is equal to 7|. This function corresponds to the wet-region-shape in first 
passage percolation. 

The fluctuation result (Theorem l2.ll) of the present paper shows that the mere existence 
of the derivative, at its maximum, of the function q i— > 7jt(g) implies that 

VaiLC n = 0(n), 

for a model with sparse long blocks added into an iid sequence with fc-equiprobable letters. 
(A block is a maximal contiguous substring consisting of only one symbol.) The model 
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considered will be described in detail at the beginning of Section [2]but let us, nevertheless, 
already give an overview of it. Let (3 and p to be reals (independent of n) such that 

-<P<1 
2 ' 

and 

< p < 1. 

Then, take d large, but fixed, while n goes to infinity (the choice of the size of d depends 
on the choice of /3, but n = 2dm). Now add into the sequence X long blocks of length 
about £ = d 13 . The possible locations for the long blocks are, say, d, 3d, 5d, . . . , 2dm — d, 
where again n = 2dm. For each possible location throw independently the same (possibly 
biased) coin to decide whether or not to place a long block there, and the probability to 
place in a given location a long block is p. The sequence X and Y both have length n, 
and the string Y is iid with fc-equiprobable letters. Moreover, X is iid with the same k 
equiprobable letters except possibly in the places where we have long blocks and letters 
could be different from long blocks to long blocks. 

Let us present an example. Take d — 5 and £ = 4, while m = 2. The length of the 
sequences X and Y is thus n = 2dm = 20. Consider binary sequences so that k = 2. We 
are thus throwing an unbiased coin independently 20 times to obtain the sequence Y. For 
example we could have: 

Y = 00101110100011010101. 

Then, we throw our unbiased coin again 20 times to obtain the sequence X*. The sequence 
X* is thus also uniform iid, and we proceed to add long blocks into X* in order to obtain 
the string X. The potential places for long blocks are the integer intervals [d(2i — 1) — 
£/2, d(2i — 1) + £/2], where i = 1, 2, . . . , m. In the present example, there are only m = 2 
such intervals: 

[3,7] and [13,17]. (1.6) 

Assume that after having thrown our unbiased coin n = 20 times, we obtained for X* the 
sequence 

X* = 01010100010100101110, 

where the bold face substrings could get replaced by long blocks. The next step is to 
throw a (possibly biased) coin for each of the intervals which could get a long block. (We 
will thus throw the coin m = 2 times.) In this way, we decide for each of the intervals in 
(jl.6p whether or not there will be a long block covering it. If the corresponding Bernoulli 
random variable Z± is equal to 1, then there will be a long block covering the interval 
[d(2i — 1) — £/2, d(2i — 1) + £/2] and if it is equal to then otherwise. The probability 
of a long block is thus equal to P(Z, = 1) = p. In the present example assume we throw 
our coin twice and obtain Z\ = 1 and Z2 = 0, then in the interval [3, 7] we place a long 
block, say made up of zeros, while in [13, 17] we leave things as they are. With these 
modifications we obtain: 

X = 01000000010100101110. 
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In other words, to obtain X from X* , we simply fill each integer interval 

[d(2i- 1) -£/2,d(2i- 1) +1/2] 

for which Zj = 1, with all the same bits and leave everything else unchanged. In our 
example, 
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As already mentioned, the main result of the present paper (Theorem 12.11) is that the 
order of magnitude of Var LC n is linear, i.e., 

Var LC n = 9(n), 

if the mean LCS curve 7^ is differentiable at its maximum 7^. To prove this result, we 
take the order of the length of the inserted long blocks larger than \fd, but smaller than 
d. More precisely, we take d and p £ (0, 1) fixed, while n the common length of X and Y 
goes to infinity. We take a parameter (3 not depending on d such that 1/2 < /3 < 1, and 
set the length of the long blocks to be i = d 13 . Our result holds, for all d large enough, 
but fixed, and assuming a block-length of £ = d 13 (d does not need to be very large for 
our fluctuation result to hold). Note also that the length of the long block does not need 
to be exactly £, it could be a little bigger, but this is of no real importance in the present 
investigation. 

The present paper was partly motivated by remarks from computational biologists to 
the effect that DNA distribution is not homogeneous. Rather there are different parts, 
with different biological functions (exon, coding parts, non-coding parts, . . . ) . These 
different parts, having different lengths and each having its own distribution, are often 
modeled by computational biologists using hidden Markov chains; the hidden states de- 
termining the parts. Once the hidden states are determined, the DNA-sequence is drawn, 
by using the corresponding distribution for each part. 

The reader might wonder how realistic our present long block model is, in view of this 
hidden-Markov model. Why did we add long blocks in predetermined positions and why 
do they only get added into one sequence and not both? Also, in DNA-sequences there 
are typically no long blocks. Let us present the various restrictions of our model and 
explain which features are present only to simplify the already involved notation, but do 
not represent a fundamental restriction: 

1. The first restriction is that we add long blocks in predetermined locations. This 
restriction is only there to simplify notation. The same proof works if we use a 
Poisson point process with intensity-parameter A = l/2d to determine the locations 
of the long blocks. Also, we could take the length of the long blocks to be a geometric 
random variable with expectation £ = d 13 . 

2. Another quite unnatural restriction is to put the long blocks only into one sequence. 
This is done again to simplify notations. If the starting location of the long blocks 
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is given by a Poisson point process with intensity A = l/2d, then we can add long 
blocks in both sequences X and Y . We would then use independent Poisson point 
processes with the same intensity for both X and Y. The proofs presented here 
work as well for this case. 

3. The model with long blocks added in Poisson locations, is very similar to a 2-state 
hidden Markov chain. For this we could take the hidden states to be L and R. The 
state L would correspond to a long block, while R would be the places where the 
string is iid. The transition probabilities from L to R would be l/d 13 , while from R 
to L it would be l/2d. Again, for this hidden 2-state model proofs very similar to 
the ones presented here will give the linear order of the variance, but the notations 
would have to become even more cumbersome. 

4. In DNA-sequence, there are no long strings consisting only of one symbol. So, the 
long block model may at first not look very realistic. However, in place of long 
blocks, we can take pieces generated by another distribution. For this we take two 
ergodic distributions. Typically one could use a finite Markov chain or a hidden 
Markov model with finitely many hidden states. For each different part, we would 
use the corresponding distribution. We could use a hidden Markov model first to 
determine which positions belong to which part and then fill the part with strings 
obtained from the corresponding distribution. (These corresponding distributions 
will again typically be hidden Markov with finitely many hidden states or Markov 
or finite Markov, maybe even Gibbs.) The hidden states could again be L and R. 
(To simplify things here we assume that there are only two DNA-parts.) But this 
time the state L would not correspond to a long block. Rather we would have two 
stationary, ergodic distributions lil and lir. The places with hidden state R would 
get the DNA-sequence drawn using /i R , while for the positions with hidden state L, 
we would draw the DNA-sequence from lil- The transition probabilities between 
L and R would be as before: from L to R it would be \/dP, while from R to L it 
would be l/2d. We believe that our current approach to determine the order of the 
variance could work for this hidden Markov chain case, provided we had: 

7l,h(?) < 7fl(?), (1-7) 

for all q in an appropriate closed interval around 0. Here, Jl,r(q) is the coefficient 
for the mixed model: 

, ElLCSjVMVs. . . V n . nq - W 1 W 2 . . . W n+nq )\ 
1l,r{q) ■= lim 



n 



when the string V1V2 . . . is drawn according to 11^ and W1W2 ... is drawn according 
to jiR. Similarly, jn(q) is the parameter when both sequences are drawn according 
to LIr\ 

ElLCSjUMUg. . . U n . nq , W X W 2 . . . W n+nq )\ 
7fl(g) := Inn 



n 
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where both strings U-JJi... and W1W2 ■ ■ ■ are drawn independently from each other 
with distribution /xr. The condition (jl.7p makes sense: it seems clear that when 
we align two sequences drawn from the same distribution typically we should get 
a longer LCS than if we align sequences from a different distribution. This might 
be difficult to prove theoretically. (This is why we consider long blocks, since they 
make this kind of condition easily verifiable.) Also, we would need for 7^ to have 
a well defined derivative at all its maximal points. In fact, instead of long blocks, 
any atypical long substrings such that its asymptotic expected LCS is smaller than 
7^ will do. 

5. A true restriction of our method is that the long blocks are of order greater than 
yd. Our current methodology does not carry over when this is lacking. It should 
be noted however, that different parts (exon, coding, non-coding part) of DNA are 
often pretty long, so the current assumption might not be totally unrealistic. We 
do not know how to treat the case, when the added long blocks have length below 
Vd. 

6. We also assumed that the long blocks have length of order below a constant time d. 
For the cases with long blocks of order d times a constant, a different paper would 
need to be written. But this seems clearly within reach, considering the present 
results. 

Summarizing: if we are willing to accept that the function 7^ has a well defined derivative 
at its maxima and that the condition (11.71) holds, we probably should be able to get 

Var LC n = 9(n), 

for a whole range of distributions used in practice to model DNA. We plan to investigate 
this problem in the future. 

Let us briefly describe the content of the rest of the paper. In the next section, the 
problem of the order of the variance is first reduced to the biased effect of long blocks. 
For this, we choose one long block at random and change it back to iid. Theorem 12.21 then 
states that if such a random alteration has typically a sufficiently strong biased effect, 
then the linear order of the variance follows. After Theorem 12.21 the rest of the section 
is devoted to establishing the biased effect of the long block replacement. It is shown 
that from the biased effect for a one- long-block situation (which was established in [3]), 
a biased effect follows for the many-long-blocks case. 

2 Long blocks and the variance 

In the present section, we consider strings of length n with many long blocks added. 
This many-long-blocks model was briefly explained in the introduction and it is precisely 
defined now: First, the string Y — Y\ • • ■ Y n is iid with k equiprobable symbols. Next, d is 
taken large but fixed as n goes to infinity and we partition the iid sequence X* = X* . . . X* 
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into pieces of length 2d, so that n = 2dm. We then insert, say, in the middle of each of 
these pieces at most one long block, deciding at random which pieces get a long block of 
length i and which do not. In all places where there is no long block, the sequence is iid 
with k equiprobable symbols. Let us explain in more details how this string X is defined: 
For each i — 1, 2, . . . , m, let Jj be the interval 



Ji 



(2i-l)d- £ -,(2i-l)d+ £ - 



i.e., Ji is the i-th place where a long block could be introduced. We assume that X* = 
X{X 2 X% . . . X* is iid uniform. The string X is equal to X* everywhere except possibly 
at the places where we put long blocks: 

m 

X % :=X*, \/ t e[l,n]-\Jj r 

3=1 

Let Zi be the Bernoulli random variable which, when equal to one, places a long block 
into the interval Jj. Hence, 

Zi := 1 implies X h = X j2 Vji, j 2 G J h 

and let Z 1 , Z 2 , . . . , Z m be iid Bernoulli random variables with 

P(Zi = l)=p, 

where p G (0, 1). Hence, p is nothing but the probability to have a long block introduced 
artificially into one of the possible locations. (The variables Z 1 , Z 2 ,. . . ,Z m are all indepen- 
dent of X* and Y, and the string Y = Y{Y 2 ■ ■ ■ Y n is independent of X and X*.) Moreover 
the strings are drawn from an alphabet A = {oti, a 2 , . . . , a^) with k equiprobable symbols: 

P(X, = a 3 ) = P(X* = a 3 ) = ¥(Y t = a 3 ) = i 

for alH = 1, . . . , n = 2dm and all j 6 {1,2,..., k}. 

We are now ready to state the main result of the paper. It gives the asymptotic order 
of the variance of the LCS of X and Y for the distribution with many long blocks added. 
It is valid for any alphabet size k. 

Theorem 2.1 Let X and Y be two independent strings of length n = 2dm drawn from 
an alphabet with k equiprobable letters, k > 2. Let Y be iid and let X be a string with 
artificially long blocks randomly placed into some of the location J\, J 2 , . . . , J m . Each of 
the locations has a probability p to receive a long block independently of the others. Outside 
the long block areas, the string X is iid. Let the concave function 7^ be differentiable at 
its maximum, then there exists d\ such that for all d> d\ independent of n, 

YaiLCn = 6(n). 

(Recall that d is held fixed while n goes to infinity.) 
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For the above theorem to hold, it is enough to show that the change of one long block 
(picked at random) induces an expected increase in the LCS. For this we choose in X one 
of the long blocks at random and change it back into iid. We assume that all the long 
blocks have equal probability to get chosen and the string obtained by changing one long 
block into iid is denoted by X = X\X 2 . . . X n . Let us describe X a little bit more formally: 
First recall that X* denotes "the iid string X before the long blocks are introduced." Let 
iV be the total number of long blocks in X, i.e., 

m 

i=i 

and let be the index of the j-ih long block, i.e., if 

i—l i 
s=l s=l 

then i(J) := i. 

Next, let M be a random variable which is uniform on {1, 2, . . . , r} when N = r: 

P(M = j | N = r) = -, Vj < r, 

and assume that conditionally on N = r, the variable M is independent of X, X* and Y. 
The block we change has index i{M), therefore 

F(X s = X s: Vs£ J i{M) ) = l 

and 

p(x s = x;,v s g j, (M )) = i. 

In other words, the strings X and X are the same everywhere except on the interval Ji(M), 
on that interval, X is equal to the iid sequence X*. We are now ready to formulate the 
result stating that in order to show that VarLC„ = O(n), it is enough to prove that the 
randomly changed block typically has a positive biased effect on the length of the LCS: 

Theorem 2.2 If there exist two constants Ci,c 2 > independent of n or d such that 

F(E (\LCS(X;Y)\ - \LCS(X;Y)\ X,Y^j > Cl d^j >l-e~ C2n , (2.1) 

for all d large enough (but independent of n), then 

Var LC n = 0(n). 
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Proof. The idea is to represent LC n = \LCS(X; Y)\ as a function / of a binomial random 
variable N with / linearly increasing along some scales. Note that if a function / : R — y R, 
is such that /' > c and if T is a random variable with finite variance, then 

Var/(T) > c 2 VarT. (2.2) 

Therefore, if iV is a binomial random variable with parameters m = n/2d and p, 

Var/W > ±0£z£. (2.3) 

This last inequality gives the desired order for the variance of f(N), i.e., Var f(N) = Q(n), 
and it remains to find a way to represent LC n as f{N), where / is a function which 
typically increases linearly. 

This is done as follows: let X(i) denote a string of length n whose distribution is the 
distribution of X conditional on the number of long blocks to be £: 

£{X{£)) = C(X | N = £), 

where C stands for the law of the corresponding random variables. The strings X(£) are 
all taken independent of Y and of N . We first simulate X(m). For this, X(m) is a string 
of length n with long blocks in every interval Jj for i = 1, 2, ... ,m and iid outside those 
intervals. Hence X(m) is the string "with maximum number of long blocks inserted." 
Then we obtain X(m — 1) by choosing in X(m) one long block at random and turning it 
into iid and proceeding by induction, once X(£) is defined we obtain X{£ — 1) by choosing 
one long block in X(£) at random and turning it into iid. Again, all long blocks have 
same probability to get chosen. (We consider only the artificially inserted long blocks, 
and not blocks in the iid part which might be long by chance.) Next, let 

LC n (£) :=\LCS(X(£);Y)\. 

It is easy to notice that, with this construction, X(£) has the same distribution as X 
conditional on N = £ and therefore that X(N) has the same distribution as X. So LC n 
has the same distribution as LC n (N) and 

Var LC n = VarLC n (N). 

Take now / : £ H> LC n (£). Note that by condition tfTtfi . £ ^ LC n (£) behaves like 
a biased random walk path which insures that the function LC n (-) tends to increase 
linearly. Clearly, £ i— > LC n (£) is typically not going to increase at every step but rather 
on a logn scale. This is enough to get an inequality like (12.31) . by extending techniques 
as in [6]. ■ 

So far, we have reduced the problem to the biased effect of our random change. Next 
we need to prove that changing a randomly chosen long block into iid has the desired 
biased effect. This biased effect is a consequence of [3], establishing, in Theorem 12.11 and 
12.21 there, inequalities such as (12.1 p but for strings of length 2d with only one long block. 
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In other words, the probability in the one long block setting, not to have an increase 
linear in the length of the block is extremely unlikely as soon as d is not too small, but 
still fixed. With this result for one long block, it should not come as a surprise that with 
many long blocks, most of them if changed into iid lead to an increase of the LCS. To 
make this argument rigorous there are two problems we have to overcome: 

1. Our result for one long block assumes that both sequences X and Y have length 
exactly equal to 2d. An optimal alignment a of sequences of length n, will however 
"map" the pieces 

X1X2X3 . . . X 2 d , X 2 d+\X 2 d+ 2 ■ ■ ■ X 4 d , X 4 d+lX4d+2 ■ ■ ■ X 6 d , • • • , X 2 d(m-1)+1 ■ ■ ■ X n 

to pieces of Y of various lengths. 

2. If a is an optimal alignment and say it aligns the piece 

^2(i-l)d+1^2(t-l)d+2 • • • X 2 id, 

with 

V V V 
1 h 1 • • • 1 in 

then the distribution of Y^Y^+i ■ ■ - Yj 2 is no longer iid but rather complicated and 
poorly understood. Our result for the one long block case assumes however the 
F-string to be iid. 

To solve these two problems, we need a new idea which is introduced next via an example: 
Take d = 3 and n = 18, and consider the three intervals: 

[1,2c/] = [1,6], [2d+l,M] = [7,12], [Ad+l,n] = [13,18]. (2.4) 

We are going to specify the intervals to which these intervals should get aligned. For 
example we could align the first with [1, 7], the second with [8, 11] and finally the third 
with [12, 18]. Within those constraints, we align in such a way to get a maximum number 
of aligned letter pairs (as usual we only allow same letter-pairs to be aligned with each 
other. Thus, one cannot align a letter with a different one). This means that in our current 
example, we align a maximum number of letter pairs of X X X 2 ■ ■ - X e and Y{Y 2 . . . Y 7 . Then 
we align a maximal number of letter pairs of X^X 8 . . . X\ 2 and Y 8 . . . Y\\. Finally we align 
X 13 . . . X 18 with Y 12 . . . Yig so as to get a maximum number of aligned letter pairs. The 
maximum number of aligned letter pairs under these constraints is hence equal to 

\LCS{X 1 X 2 ....V (i :V,... Y 7 )\+\LCS(X 7 X 8 . . . X 12 ; Y 8 . . . Y n )\ + \LCS(X 13 X U . . . X 18 ; Y 12 . . 

Note that the three terms in the sum on the right side of the above equality are inde- 
pendent. Of course the alignment defined in this way is not necessarily an alignment 
corresponding to a LCS. Indeed, let k — 2, n — 12 and the sequences x = 101010111111 
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and y = 001010011110. Then the alignment which aligns 101010 with 00 and 111111 with 
1010011110 is given by: 



X 




l 





l 





l 









1 




1 






1 


1 


1 


1 




y 






















1 





1 








1 


1 


1 


1 






(2.5) 



In fact, it corresponds to two alignments: first the alignment aligning X\X 2 . . . X G with 
Y 1 Y 2 : 







1 





1 





1 































igning 


X 7 X 8 




X 12 with Y 3 Y 4 . 


1 




1 






1 


1 


1 


1 




1 





1 








1 


1 


1 


1 






Y 12 : 



(2.6) 



(2.7) 



The alignment (12.51) is obtained by "concatenating" the alignments (12.61) and (12.7ft . The 
"score" of the alignment (12.51) is the sum of the scores of the alignment (12. 6ft and (12.71) . 
Here, the alignment (I2.6P aligned two letter pairs while (12 ,7p aligned six. Hence, when 
X = x and Y = y, the score of the alignment (I2.5P is 

\LCS{x 1 x 2 . . .x^y t y 2 )\ + \LCS(x 7 x s . . . x 12 ; y 3 y 4 . . . 2/12) | = 2 + 6 = 8. 

Now, the alignment corresponding to the LCS is 



(2.8) 



X 






1 





1 





1 







1 


1 


1 


1 




1 


1 


y 












1 





1 








1 


1 


1 


1 










and LC n = 9. In the alignment (12.8ft we have that xix 2 . . . Xq gets aligned with y\y 2 . . . y%. 

More generally, let n = 2dm, and let = r < r\ < r 2 < ■ ■ ■ < r m _i < r m — n be 
integers. Then, we study the best alignment under the following m constraints: 

1) X\X 2 . . . X 2d gets aligned with Y{Y 2 . . . Y n 

2) X 2d+1 X 2d+2 . . . X M gets aligned with Y n+1 Y ri+2 . . . Y r2 



m) X 2d ( m _i) + iX 2( i( m -i)+2 ...X n gets aligned with Y r . m _ 1+1 Y r . m _ 1+2 . . . Y n . 

The score of the best alignment under the above m constraints, denoted by LC n {r) = 
LC n (r , n, r 2 , . . . , r m ), is equal to: 

LC n (r) = LC n (r , n, r 2 , . . . , r m ) 
m— 1 

: = \LC S(X 2di+ iX 2di+2 ■ ■ ■ X 2d ( i+ iy, Y ri+ iY r . +2 ■ ■ ■ Y ri+1 )\. 

8=0 
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Let now 



LCJf) 



LC n (r ,r l ,r 2 , 

m—1 



m—1 



r m ) 



S y^ J \LC S (X2di+iX 2 di+2 • • 'X 2 d(i+i)', Y r . +1 Y r . +2 • • • Y r 



(2.9) 



i=0 



denote the score when the sequence X is replaced by the sequence X. Let TV 1 be the set 
of all the partitions of the integer interval [0, n] into m pieces: 



H n :={(r ,n 



i ■ ■ ■ i 1 m, 



e [0, n 



m+1 



?"o = < r\ < r 2 < • • ■ < r r 



n 



} 



let 7^ be a constant independent of d such that 

it < it 

and let < q e < 1 be the unique real, which exists by concavity, such that 

lk{q e ) = it- 

Let e > and let 

TZ n {e) C K n , 

be the subset of those element of lZ n which have more than a proportion 1 — 2ep of the 
values r, — ty_i in the interval 



2d- 

1 + q e l-q e 
More precisely, (r , ri, . . . , r m ) e !Z n (e) if and only if 

2d- 



(2.10) 



Card <ie {1, . . . , m} : — r^-i ^ 



g6 ,2ci- 1 + ge 



1 + g e 1 - q e 



< 2mpe. 



With these notations, we then proceed to prove that with high probability every optimal 
alignment is in TZ n (e). 

At this stage the reader, might wonder about the significance of the interval (12.101) . 
The answer is found when we consider two independent iid strings where one has length 
2d and the other has any length not in the interval (I2.10p . Then, the expected length of 
the LCS of two such strings, is less or equal to 7^ times the average of the lengths of the 
two strings. (To understand why, recall that the function 7^ is concave.) We can now 
use this for an alignment a between X and Y. Since, 7^ < 7^, we infer that if too many 
of the pieces X 2 di+\X 2 di+ 2 . . . X 2 d(i+i) are to be matched by a with a piece of Y having 
length outside (12.101) . then the score of the alignment a would, with high probability, be 
below optimal. Hence, a would typically not correspond to an LCS. This argument is 
made rigorous in the proof of Lemma 12.41 
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Denote by K n (e) the event that every optimal alignment is in lZ n (e) in other words, 
K n (e) holds if and only if 

WeTZ n , such that LC n (r) = LC n , we have reTT(e). 

Let 

A(f) = A(r ,n, ...,r m ):= LC n (r ,ri, . . . ,r m ) - LC n (r ,r 1 ,r 2 , ■ ■ ■ ,r m ). (2.11) 

After proving that K n (e) has high probability we show that with high probability, every 
alignment of lZ n (e) has a strong conditional increase. To do so, let Q n (e) denote the event 
that for every alignment of lZ n (e) the conditional increase due to replacing a long block 
by iid is at least 6P{k{\ — 2e) — 2e). More precisely, let 

Q n (f) = (E(A(f) | X,Y) > ^(k(1 - 2e) - 2e)} , 

where k > is a constant independent of n and d which will be specified later, and let 

Q»(e):= f| Q»(f). 

re7e™(e) 

We will also need an event to insure that there are enough long blocks. For this, let O n 
be the event that there are at least mp/2 long blocks, i.e., 



1=1 



The event K n and Q n {s) together imply the desired expected increase. This is the content 
of the next lemma 

Lemma 2.1 On Q n (e) n K n (e), 

E (\LCS(X; Y) | - \LCS{X- Y) \ X,Y^j > d"(/c(l - 2e) - 2e) - 2e. 

Proof. Let a be an optimal alignment of X and Y. If K n (e) holds, then a is an alignment 
in the set 1Z n (e), and by the very definition of Q n (e), 

E (A(a) | X,Y) > d p (K(l - 2e) - 2e). 

Therefore, 

E (|LCS , (X;y)| - \LCS(X;Y)\ X, Y^j > d^(/c(l - 2e) - 2e). (2.12) 



Now the above increase needs to be strictly positive to be of any use. We will see that 
holding k > 0, fixed, we can take e > as small as we want and the event Q n (e) will still 
have almost full probability, as long as d is taken large but fixed. 



15 



The bias (I2.12p . holds when K n (e) and Q n (e) both hold, therefore 

P^E (\LCS(X,Y)\ - \LCS{X,Y)\ X, y) < t^(/c(l - 2e) - 2e) (2.13) 

<¥{{K n ) c {e))+F{{Q n ) c {e)) (2.14) 
< F{(K n ) c {e)) + P((O n ) c ) + F{{Q n ) c {e) n O n ). (2.15) 

The purpose of the next three lemmas is to show that the events F((K n ) c (e)), P((0") c ) 
and F((Q n ) c (e) n O n ) hold with small probability. 

Lemma 2.2 Let k 6 N, fc > 2 and Zei 7^ < 7^. Lei £ > 0. Let < p < 1. Let d be such 
that (1 + ln2d)/2d < (7* - 7 |)V£ 2 )/32. T/ien, 

P((K") C ( £ )) < exp ^ ™ (7 * ~ ^ p2£2 



32 



/or a// n = 2dm, m G N. 



Proof. Let r = (r ,r 1 ,r 2 , . . . ,r m ) be an alignment in lZ n . Let LC*(r) denote the align- 
ment score when we align X* with Y according to r: 



m— 1 



LCn(r) :- 2j |£^(^2di+1^2di+2 ' ' •^2d(i+l)'^+ 1 ^i+2 " " " ^r j+ i)|, 
i=0 

and let LC* denote the score of the LCS when we align X* with Y: 

LCI ■= \LCS(X*X; ...XI- Y,Y 2 ...Y n )\. 

When the alignment r does not belong to TZ n (e), then for n large enough, and as explained 
at the end of the present proof, 

E(LC* n (r) - LC*) < -?(t2 - Y k )pen. (2.16) 

Next, recall that LC n (r) denotes the alignment score when we align X with Y according 
to f, while LC n := \LCS(X;Y)\. The difference between X* and X is at most m long 
block of length dP . Hence the absolute difference between LC*(r) and LC n (r) is at most 
md 13 and so is the absolute difference \LC* — LC n \. Therefore, 

\(LC* n (r) - LCI) ~ (LC n (f) - LC n )\ < 2d?m. (2.17) 

But if 

LC n {f) - LC n > 0, (2.18) 
is to hold, then, by 1 12. 17ft . necessarily 

LC*(f) - LC* n > -2d?m. (2.19) 
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Next, recall that < 1, and choose d large enough so that 

Ad^<{j* k -j e k )ped. (2.20) 
Combining (l2~T9l) with (l2~T6l) and (I2T201) leads to: 



LQ(f) - LCI - HLC^f) - LQ) >\{ll~ l e k )pen, (2.21) 



and therefore 



¥(LC n (r) - LC n > 0) < P fl*^ - LC* - E(LC* n (r) - LC* n ) >\{ll~ lt)pev\ . 

(2.22) 

Now, LC*(r)—LC* depends on the iid random variables X± , Jf|, . . . , X* and Y 1 ,Y 2 , . . . , Y n , 
and changes by at most 2 when one of these variables changes. Therefore by Hoeffding's 
exponential inequality, 

F (LC* n (r) - LC: - E(LC:(r) - LC* n ) > ^ - ^pen^j < exp (^"M^fl 

(2.23) 

Note that for (K n ) c (e) to hold, we need at least one optimal alignment f G 7£ n which is 
not in TZ n (e). But if r*is optimal then it corresponds to a LCS and thus LC n (r) — LC n > 0. 
Hence, 

(K n ) c (e)= |J {LC„(f)-LC„>0}, 

so that 

n(K n ) c (e)) < nLC n {f)-LC n >0). 
The last sum above contains at most ( n j terms and so with the help of (I2.23p . we find 

e\2 r -2 



ne\™ ( n( 7 *-7^) 2 e 



< — 



exp 



m J \ 16 

, ap MK-<)W + (itH) | (2 . 24) 

since n = 2e?m. By our choice of <i, (1 + \n.2d)/2d < (j k — r fl) 2 p 2 e 2 /32, leading to 

n (7fc -7fc)V^ 2 ' 



P((K n ) c (£)) < exp 
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Let us next detail how the inequality ( 12 . 1 6[) is obtained. This inequality only holds for 
alignments r which are not in TZ n (e). Hence, assume that r — (?"0) r i) • • • i r m) £ H n , but 
that f ^ TZ n (e). Then, there are at least Imps of the substrings 

Y ri+ iY ri+2 ■ ■ ■ Y n+1 , (2.25) 
17 



having their length not in the interval f 1 2 . X j) . If the string (12. 25[) has its length outside 
(12.1 Op . then, as explained next, the expected value with the corresponding piece of X* is 
at most 7^ times half the number of symbols involved. Thus, if the length of (I2.25P is not 
in f[2~TUj) , then 

E|LCS(X 2 V 1 X 2 V 2 ---^ 2 Vi)!^+i^+2...^ + J| < ^(2d + r i+1 - n ) (2.26) 



(To obtain the last inequality, use the fact that the expectation on the left side of ([23 
is by definition of "fk(-, •) (see (jl.4p ) equal to "fk(j, <Z*)/2 times the number of symbols j 
involved, where q* = (r^+i — — 2d)/j, while j = 2d + rj+i — r^. Moreover, the function 
t — > 7k{t,q*) is subadditive so that 

lk(t,q*)<j k (q*), (2.27) 

for all t e N. When r i+1 — r; is outside the interval (12. 10 j) . then q* is outside of [— q e , q e }. 
But since the function 7^ is symmetric around the origin and concave, 

lk(q*) < lk(q e ) = ll (2.28) 
Combining (I2T2T|) and f[2T2gj) . leads to 

7*0', 9*) < 7k- 

This last inequality and the fact that the left side of (I2.26P is equal to 7fc(j, <7*)/2 times 
the number of symbols involved, jointly imply the inequality (I2.26p .) 

We can now apply a very similar argument for those i's, for which rj+i — is in (12.101) . 
For those z's, instead of inequality (I2.26p . we find the following: 

E\LCS(X; di+l X; di+2 ...X; d{i+1 y,Y n+ ^^ < ^(2d + r i+1 -n). (2.29) 

Combining (I2.29p and (I2.26p . we have: 

m— 1 

ELC n (r) = 22 E \ L ^^(^2di+i^2di+2 ' ' ' X£ d ( i+1 y, Y r . +1 Y r . +2 ■ ■ ■ Yr i+1 )\ 

e m— 1 * m— 1 

£ (2rf + rm - rj ) + ^ £ ( 2 rf + r m -r,) 

* m-1 / e _ * \ m_1 

= ^^(2d + r (+1 -r ( )+(^-2^j £ (2<* + 

1=0 ^ ' 



<^(2rfm + n)+p-^) £ 2rf 



e * \ m— 1 



2 



<=0 



7>-(7fc-7>^- (2-30) 
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Next, as n — > oc, ELC*/n — > 7^. So, taking n large enough, we obtain that 

Tl 

VLC* n > 1 * k n--{ 1 t- 1 l)pe. (2.31) 

In fact, our conditions on d, imply that ( 12 .3 1 j) is satisfied by (II. 2p for all n = 2dm. 
Combining now (12.301) and (I2.3ip . gives the desired inequality (12.161) . ■ 

Lemma 2.3 2 

P(O n ) > 1 - exp \~~ 

Proof. The total number of long blocks Yl^Lt * s a binomial random variable with 
parameters m = n/2d and p. Thus, 

£ Z, < f J = P (j> - Ej> < -f j < exp (-^ 

by Hoeff ding's inequality. ■ 

Lemma 2.4 Let e > 0. Let < p < 1. Let r/ = 2/3i - 1 ; w/iere 1/2 < fa < f3 < 1. For <i 
/arg'e enough, 



P((Q n ) c (e) n O n ) < exp (-«|) = exp (-Cn- 



pe 



Sd 1 ^, 

for all n = 2dm and where C > is a constant independent of d, n and e. 

Proof. We have already shown in [3] in the one long block situation, that changing the 
long block into iid tends to increase the LCS-score. For this check out Theorems 2.2 and 
2.3 in [3] and the events O d and H d there. Now, these results are for the case when the 
two strings have length exactly equal to 2d. However the same order of magnitude for 
the probability holds true, if the sequence Y has length close to 2d instead of exactly to 
2d, where by close to 2d, we mean length in the interval (I2.10p . Thus, we have for all d 
that: for all j contained in the interval (I2.10p . we have 

F(\LCS(X* . . . X; d ; Y x ... Yj)\ - \LCS(X 1 . . . X 2d ; Y x . . . Yj)\ > «£\Z X = 1) 

> 1 -expi-Cd 2 ^- 1 ), (2.32) 

where k > and C > are constants not depending on d or j, but depending on the size 
of the alphabet, i.e., k. The above is obtained, for j = 2d, in the proof of Theorems 2.2 
and 2.3 in [3]. The same proof holds true for j in the interval (I2.10p . so we leave it to the 
reader. 

Now, let f be an alignment of TZ n (e), and let M"(r) be the event that among the integers 
i = 1,2, . . . ,m, there are less than mpe/2 of them for which: the length — r^ x is in the 
interval (I2.10p and there is a long block, but the LCS of X 2 d(i-i)+iX 2 d(i-i)+2 ■ ■ ■ X 2 di with 
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^r i _ 1 +i^r l _i+2 • • • y ri does not increase by at least when we replace the long block by 
iid. Hence, M"(r) is the event that the set 



,1 — q c ,1 + q c 
2d —,2d- 



1 + q c 1 — q 



Air), < ad? 



i e {1,2, ... ,m} : Zi = l,ri - G 
contains less than rape/ 2 elements. Here, 

A(r)j := \LCS(X 2d (i-i)+i ■ • • X 2d %\ Y r ._ 1+1 . . . Y r . +1 )\ — \LCS(X 2 d(i-i)+i . . . X 2di ; Y r ._ 1+l ■ ■ ■ Y ri . 

Now, the probability for one such substring pair to have its LCS-score not increase by 
Kd@, is less than exp(— Cd 2 ^ 1 ^ 1 ), when r« — r;_i is in (I2.10p (by inequality (I2.32p ). This 
then leads to: 



P((M;) c (f)) < exp (-Cd^" 1 ) . (2.33) 



Now 



in \ ( 2me \ i 

rape 
2 



<| | = e — v 1+ln ^, 

mpe 



and therefore (12.331) becomes 



P((M £ ") c (f)) < e ^( 1+l ^- Cd2 ^ 1 ) . (2.34) 

Note that the bound on the right side of the last inequality above is exponentially small 
in n when d is held fixed while n goes to infinity. (We also need cP to be larger than 
2/(Cp£log 2 e).) Next note that when M"(r), and O n both hold, and r E TZ n {e), then 

E(A(f) \X,Y)> d p (K(l - 2e) - 2e), (2.35) 

where, as defined before, A(r) is the change in score of the alignment r, due to the 
modification of a randomly chosen long block into iid. (See (I2.1ip .) The reason for the 
last inequality above is as follows: due to M n (e), there are no more than rape/ 2 strings 
^2<f(i-i)+i^2d(i-i)+2 • • • X 2 di for which a change of the long block into iid does not create 
an increase of at least Kdr and for which — r»_i is in the interval (I2.10p . But by O n there 
are more than rap/2 long blocks. Each long block has the same probability to get chosen 
and changed to iid. Therefore the probability, to choose a long block so that the change 
does not create the increase we want but so that the corresponding — rj_i has length in 
(I2.10p . must be less than (rape/ 2) /(rap/ 2) = e. Similarly, since r G 7Z n (e), there are no 
more than rape/2 of the integers i G {1, 2, . . . , m}, for which r, — rj_i is not in the interval 
(12. lOD . So, the probability to chose a long block in X 2 (i(!-i)+i^2d(t-i)+2 • • - X^di with the 
corresponding r, — ty_i outside (12.101) . is also less or equal to e. Next, we say that the 
string X 2 d(i-i)+iX2d(i-i)+2 ■ ■ ■ X 2 di is with "defect" when either changing the long block 
does not create an increase of at least nd^ or that — Tj_i is not in (12.101) . Let T n (r) be 
the event that the long block chosen is not with defect. From our discussion above, we 
find that when O n (e) and M"(f) both hold and r G 1Z n (e), then 

P((T") c (f) | X,F) < 2e. (2.36) 
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Since 

E(A(f) \X,Y) = E(A(f) | X, Y, (T n ) c )P((T n ) c | X, Y) + E(A(r) | X, Y, T n )¥(T n \X,Y), 

(2.37) 

and, since by its very definition, when T n holds A(r) is greater or equal to d^K, it follows 
that, 

E(A(f) | X, F, T n ) > d^K. (2.38) 

Moreover, since only one block of length dr is changed, the change in score cannot be 
below — d 13 , and therefore 

E(A(r) | X,F,T nc ) > -dP '. 
This last inequality combined with (I2.38P and ( I2.36P in (I2.37P yields, 

E(A(r) \X,Y)> -d p 2e + Kd p (l - 2e) = d"(/c(l - 2e) - 2e). (2.39) 

Now 02.391) is obtained, assuming that r G 7£(e) and that M"(f) and O n (e) both hold. 
By definition, the event Q n (r) is equivalent to inequality (I2.39P and so for r e 1Z n (e), 



and thus 
Since 

using (I2.40j) . we get 
Applying (I2.34p then gives 



o n n M;(f) c Q"(f), 

P(O n n (Q n ) c (r)) < F((M n ) c £ (f)). (2.40) 



p(O n n (Q") c ) < Yl p ( M "W) • 



F(O n n(Q"f) < Yl e^+^-c*^) < / - j^i-cf^-) (241) 



< ™ e ™(i-cf^-) (2 . 42) 



_ e -m(-2-21n2d+Cif d 2 / 3 !" 1 ) ^ 43) 

Taking d such that 

_ 8 + 81n2d 
and combining it with (12. 43ft . finally leads to 



P(O n n (Q n ) c ) < exp (-C^md") 
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which finishes this proof. ■ 

Proof of Theorem 12.11 By Theorem I2.2[ in order to prove that Var LC n = Q(n) 
it is enough to show the high probability of a bias like in (12. ip . However, the inequality 
(I2.13P asserts that the probability of the bias 

f(e(\LCS(X;Y)\ - \LCS(X;Y)\\X,Y) < dP{n{l - 2e) - 2e)) , (2.44) 

is bounded above by 

¥({K n ) c {e)) + P((O n ) c ) + P((Q") C (£) n O n ). (2.45) 

Lemma [2.21 12.31 and 12.41 imply that the bound (|2.45|) . is exponentially small in n. So, 
the expected change (12.441) . is larger than d"(n(l — 2e) — 2e) with probability close to 
one. The "close to one," is up to an exponentially small quantity in n. In order to apply 
Theorem 12.21 we would need a bias larger than c\dP where c\ > can be any constant 
not depending on n and d. To achieve this, simply take e > small enough so that 

«(1 - 2s) - 2e > 0. (2.46) 

e.g., < e = «/4(k + 1). With this choice of e, the bound (12.461) is equal to k/2 where 
k > does not depend on n or d and, in turn, the expected conditional increase (12.441) 
is at least equal to Kd^/2 with high probability. Hence, by Theorem 12. 2\ it follows that 
Var LC n = 0(n), for d large enough but fixed. This finishes the proof. ■ 
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